Statistical Machine Translation : Robust parameter estimation from noisy corpus
نویسنده
چکیده
In this report, we describe our study of effect of noise on parameter estimation for statistical machine translation. So far, no study has been done on this topic, even though the algorithm used for parameter estimation for statistical machine translation (the EM algorithm) is known to be highly sensitive to noise. We present in detail the experiments performed to observe the influence of noise on parameter estimation, and the various methods investigated to counter this effect.
منابع مشابه
Two Ways to Use a Noisy Parallel News Corpus for Improving Statistical Machine Translation
In this paper, we present two methods to use a noisy parallel news corpus to improve statistical machine translation (SMT) systems. Taking full advantage of the characteristics of our corpus and of existing resources, we use a bootstrapping strategy, whereby an existing SMT engine is used both to detect parallel sentences in comparable data and to provide an adaptation corpus for translation mo...
متن کاملDiscriminative Corpus Weight Estimation for Machine Translation
Current statistical machine translation (SMT) systems are trained on sentencealigned and word-aligned parallel text collected from various sources. Translation model parameters are estimated from the word alignments, and the quality of the translations on a given test set depends on the parameter estimates. There are at least two factors affecting the parameter estimation: domain match and trai...
متن کاملUsing Noisy Bilingual Data for Statistical Machine Translation
SMT systems rely on sufficient amount of parallel corpora to train the translation model. This paper investigates possibilities to use word-to-word and phrase-to-phrase translations extracted not only from clean parallel corpora but also from noisy comparable corpora. Translation results for a Chinese to English translation task are given.
متن کاملStatistical Machine Translation with Word- and Sentence-Aligned Parallel Corpora
The parameters of statistical translation models are typically estimated from sentence-aligned parallel corpora. We show that significant improvements in the alignment and translation quality of such models can be achieved by additionally including wordaligned data during training. Incorporating wordlevel alignments into the parameter estimation of the IBM models reduces alignment error rate an...
متن کاملRobust Estimation of Feature Weights in Statistical Machine Translation
Weights of the various components in a standard Statistical Machine Translation model are usually estimated via Minimum Error Rate Training. With this, one finds their optimum value on a development set with the expectation that these optimal weights generalise well to other test sets. However, this is not always the case when domains differ. This work uses a perceptron algorithm to learn more ...
متن کامل